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^ : Abstract 
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. 017 ^ neural probabilistic language model (NPLM) provides an idea to achieve the 

018 better perplexity than n-gram language model and their smoothed language mod- 

019 els. This paper investigates application area in bilingual NLP, specifically Sta- 
' 020 tistical Machine Translation (SMT). We focus on the perspectives that NPLM 
' 021 has potential to open the possibility to complement potentially 'huge' monolin- 

022 gual resources into the 'resource-constraint' bilingual resources. We introduce an 

Q ■ ngram-HMM language model as NPLM using the non-parametric Bayesian con- 

struction. In order to facilitate the application to various tasks, we propose the 
joint space model of ngram-HMM language model. We show an experiment of 
system combination in the area of SMT. One discovery was that our treatment 
of noise improved the results 0.20 BLEU points if NPLM is trained in relatively 
small corpus, in our case 500,000 sentence pairs, which is often the case due to 
' 028 jjjg jojjg training time of NPLM. 

030 

Q , 031 1 Introduction 

Cn ■ 032 

033 A neural probabilistic language model (NPLM) |I3] ID and the distributed representations ||251 pro- 

034 vide an idea to achieve the better perplexity than n-gram language model [47] and their smoothed 

035 language models ||26l l9l l48J . Recently, the latter one, i.e. smoothed language model, has had a lot 
I 036 of developments in the line of nonparametric Bayesian methods such as hierarchical Pitman- Yor 

?-H ■ 037 language model (HPYLM) and Sequence Memoizer (SM) f5T,'20|, including an application to 

SMT 1 36, 37, 38 1 . A NPLM considers the representation of data in order to make the probability 
distribution of word sequences more compact where we focus on the similar semantical and syntac- 
tical roles of words. For example, when we have two sentences "The cat is walking in the bedroom" 
and "A dog was running in a room", these sentences can be more compactly stored than the n-gram 
language model if we focus on the similarity between (the, a), (bedroom, room), (is, was), and (run- 
0^2 ning, walking). Thus, a NPLM provides the semantical and syntactical roles of words as a language 

043 model. A NPLM of |3| implemented this using the multi-layer neural network and yielded 20% to 

044 35% better perplexity than the language model with the modified Kneser-Ney methods ||9| . 



There are several successful applications of NPLM ET] [TTl |42l (TO] [T!] [141 |43l . First, one category 
of applications include POS tagging, NER tagging, and parsing lfT2l IT). This category uses the 
0''^ features provided by a NPLM in the limited window size. It is often the case that there is no such 

048 long range effects that the decision cannot be made beyond the limited windows which requires to 

049 look carefully the elements in a long distance. Second, the other category of applications include 

050 Semantic Role Labeling (SRL) task lfT2l[T4l . This category uses the features within a sentence. A 

051 typical element is the predicate in a SRL task which requires the information which sometimes in 

052 a long distance but within a sentence. Both of these approaches do not require to obtain the best 

053 tag sequence, but these tags are independent. Third, the final category includes MERT process [421 
and possibly many others where most of them remain undeveloped. The objective of this learning 
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in this category is not to search the best tag for a word but the best sequence for a sentence. Hence, 
we need to apply the sequential learning approach. Although most of the applications described in 
lfm[T0l[T2l fT4l are monolingual tasks, the application of this approach to a bilingual task introduces 
really astonishing aspects, which we can call "creative words" |50|, automatically into the traditional 
resource constrained SMT components. For example, the training corpus of word aligner is often 
strictly restricted to the given parallel corpus. However, a NPLM allows this training with huge 
monolingual corpus. Although most of this line has not been even tested mostly due to the problem 
of computational complexity of training NPLM, |43 1 applied this to MERT process which reranks 
the n-best lists using NPLM. This paper aims at different task, a task of system combination [11 
|29l |49l [T5l [131 l35l . This category of tasks employs the sequential method such as Maximum A 
Posteriori (MAP) inference (Viterbi decoding) [|22] |44] |33] on Conditional Random Fields (CRFs) / 
Markov Random Fields (MRFs). 

Although this paper discusses an ngram-HMM language model which we introduce as one model of 
NPLM where we borrow many of the mechanism from infinite HMM fT9l and hierarchical Pitman- 
Yor LM [48 1, one main contribution would be to show one new application area of NPLM in SMT. 
Although several applications of NPLM have been presented, there have been no application to the 
task of system combination as far as we know. 

The remainder of this paper is organized as follows. Section 2 describes ngram-HMM language 
model while Section 3 introduces a joint space model of ngram-HMM language model. In Section 
4, our intrinsic experimental results are presented, while in Section 5 our extrinsic experimental 
results are presented. We conclude in Section 5. 



2 Ngram-HMM Language Model 



Generative model Figure [T] depicted an example of ngram-HMM language model, i.e. 4-gram- 
HMM language model in this case, in blue (in the center). We consider a Hidden Markov 
Model (HMM) l40l[2n [2l of size K which emits n-gram word sequence Wi, . . . , Wi^K+i where 
hi, ... , hi^K+i denote corresponding hidden states. The arcs from Wi_3 to Wi, ■ • • , Wi^i to Wi 
show the back-off relations appeared in language model smoothing, such as Kneser-Ney smoothing 
ll26j . Good-Turing smoothing [,24 J . and hierarchical Pitman-YorLM smoothing [48]. 





Figure 1 : Figure shows a graphical representation of the 4-gram HMM language model. 



In the left side in Figure[Tl we place one Dirichlet Process prior DP(q;, H), with concentration pa- 
rameter a and base measure H, for the transition probabilities going out from each hidden state. 
This construction is borrowed from the infinite HMM ||2] [19] ■ The observation likelihood for the 
hidden word ht are parameterized as in wt\ht ~ F{(j)ht ) since the hidden variables of HMM is lim- 
ited in its representation power where denotes output parameters. This is since the observations 
can be regarded as being generated from a dynamic mixture model |[T9l as in ([Tji, the Dirichlet priors 
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on the rows have a shared parameter 



K 



hi = l 
K 



= ^ '^k,h,p{Wi\(j)hi) (1) 



hi = l 



In the right side in Figure [T] we place Pitman- Yor prior PY, which has advantage in its power-law 
behavior as our target is NLP, as in (|2): 



where a is a concentration parameter, is a strength parameter, and Gi is a base measure. This 
construction is borrowed from hierarchical Pitman- Yor language model | 



p{ht\wi.,t,Ul.,t) = p{wt\ht) p(/lt_l|wi:t_l,Ul:t_l) (3) 



110 

111 p{wi\hi^i = k) = p{hi\hi^i ^ k)p{wi\hi) 

112 
113 
114 
115 
116 
117 
118 

119 w,|u.i:,_i - ?Y[d,,e,,G,) (2) 

120 
121 
122 

Inference We compute the expected value of the posterior distribution of the hidden variables with 
a beam search ||T9l . This blocked Gibbs sampler alternate samples the parameters (transition matrix, 

125 output parameters), the state sequence, hyper-parameters, and the parameters related to language 

126 model smoothing. As is mentioned in |19|, this sampler has characteristic in that it adaptively 

127 truncates the state space and run dynamic programming as in (O: 
128 

129 FV<'t\'JJ L:tT ^L:t ) — F\'^t\'n) / ^ 

130 /lt_i:«t<77<''t-l-'''' 

where ut is only valid if this is smaller than the transition probabilities of the hidden word sequence 

132 hi, ... , hx- Note that we use an auxiliary variable Ui which samples for each word in the sequence 

133 from the distribution Uniform(0, tt^'^'^^''''-'). The implementation of the beam sampler con- 

134 sists of preprocessing the transition matrix tt and sorting its elements in descending order 
135 

136 Initialization First, we obtain the parameters for hierarchical Pitman- Yor process-based language 

137 model Ii48ii23l . which can be obtained using a block Gibbs sampling t32J . 
1 38 

Second, in order to obtain a better initialization value h for the above inference, we perform the 
following EM algorithm instead of giving the distribution of h randomly. This EM algorithm in- 
^'"^ corporates the above mentioned truncation |T9l. In the E-step, we compute the expected value of 

141 the posterior distribution of the hidden variables. For every position hi, we send a forward message 

142 a{hi-n+i:i-i) in a single path from the start to the end of the chain (which is the standard forward 

143 recursion in HMM; Hence we use a). Here we normalize the sum of a considering the truncated 

144 vai-iables Ui_„+i:i_i. 

146 a(/ii-n+2:i) = ' ^l p{Wi\hi)y^ a{ui-n+l:t-l)P{hi\hi-n+l:i-l) (4) 

148 Then, for every position hj, we send a message /3(^i-n+2:i, hj) in multiple paths from the start to 

149 the end of the chain as in (|5]i, 

150 
151 
152 

153 This step aims at obtaining the expected value of the posterior distribution (Similar construction to 

154 use expectation can be seen in factored HMM [22]). In the M-step, using this expected value of 
.jgg the posterior distribution obtained in the E-step to evaluate the expectation of the logarithm of the 

complete-data likelihood. 

156 
157 

158 3 Joint Space Model 

159 

160 In this paper, we mechanically introduce a joint space model. Other than the ngram-HMM language 

161 model obtained in the previous section, we will often encounter the situation where we have another 
hidden variables h^ which is irrelevant to /i" which is depicted in Figure |2] Suppose that we have 



f3{hi^n+2:i,h-i) = ^"(^' "+!■' P[wi\hi)y^ l3{hi^ri+l:i-l, hj)P{hi\hi^n+l-.i-l) (5) 
2^a(Ui_„+l:i_l) ^ 
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the ngram-HMM language model yielded the hidden variables suggesting semantic and syntactical 
role of words. Adding to this, we may have another hidden variables suggesting, say, a genre ID. 
This genre ID can be considered as the second context which is often not closely related to the first 
context. This also has an advantage in this mechanical construction that the resulted language model 
often has the perplexity smaller than the original ngram-HMM language model. Note that we do 
not intend to learn this model jointly using the universal criteria, but we just concatenate the labels 
by different tasks on the same sequence. By this formulation, we intend to facilitate the use of this 
language model. 




Figure 2: Figure shows the joint space 4-gram HMM language model. 

It is noted that those two contexts may not be derived in a single learning algorithm. For example, 
language model with the sentence context may be derived in the same way with that with the word 
context. In the above example, a hidden semantics over sentence is not a sequential object. Hence, 
this can be only considering all the sentence are independent. Then, we can obtain this using, say, 
LDA. 

4 Intrinsic Evaluation 

We compared the perplexity of ngram-HMM LM (1 feature), ngram-HMM LM (2 features, the same 
as in this paper and genre ID is 4 class), modified Kneser-Ney smoothing (irstlm) 1 18 1, and hierar- 
chical Pitman Yor LM 1,48 J . We used news201 1 English testset. We trained LM using Europarl. 

ngram-HMM ( 1 feat) ngram-HMM (2 feat) modified Kneser-Ney hierarchical PY 



Europarl 1500k 114.014 



113.450 



118.890 



118.884 



Table 1 : Table shows the perplexity of each language model. 



5 Extrinsic Evaluation: Task of System Combination 

We applied ngram-HMM language model to the task of system combination. For given multiple 
Machine Translation (MT) outputs, this task essentially combines the best fragments among given 
MT outputs to recreate a new MT output. The standard procedure consists of three steps: Minimum 
Bayes Risk decoding, monolingual word alignment, and monotonic consensus decoding. Although 
these procedures themselves will need explanations in order to understand the following, we keep 
the main text in minimum, moving some explanations (but not sufficient) in appendices. Note that 
although this experiment was done using the ngram-HMM language model, any NPLM may be 
sufficient for this purpose. In this sense, we use the term NPLM instead of ngram-HMM language 
model. 

Features in Joint Space The first feature of NPLM is the semantically and syntactically similar 
words of roles, which can be derived from the original NPLM. We introduce the second feature in 
this paragraph, which is a genre ID. 

The motivation to use this feature comes from the study of domain adaptation for SMT where it be- 
comes popular to consider the effect of genre in testset. This paper uses Latent Dirichlet Allocation 
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(LDA) ||5]ll6l|6]|45][33l to obtain the genre ID via (unsupervised) document classification since our 
interest here is on the genre of sentences in testset. And then, we place these labels on a joint space. 

218 

2^19 LDA represents topics as multinomial distributions over the W unique word-types in the corpus and 
220 represents documents as a mixture of topics. Let C be the number of unique labels in the corpus. 

Each label c is represented by a T4^-dimensional multinomial distribution (pc over the vocabulary. 
222 ^'^^ document d, we observe both the words in the document w^'^^ as well as the document labels 

c'-'^K Given the distribution over topics 9d, the generation of words in the document is captured by 
^^■^ the following generative model. The parameters a and /3 relate to the corpus level, the variables 9d 

belong to the document level, and finally the variables Zdn and Wdn correspond to the word level, 
225 which are sampled once for each word in each document. 
226 

Using topic modeling in the second step, we propose the overall algorithm to obtain genre IDs for 
228 testset as in (|5]i. 

229 

1 . Fix the number of clusters C, we explore values from small to big where the optimal value 
will be searched on tuning set. 

231 

222 2. Do unsupervised document classification (or LDA) on the source side of the tuning and test 

sets. 

233 

234 (a) For each label c G {1,...C}, sample a distribution over word-types (pc ^ 

235 Dirichlet( 1^) 

236 (b) For each document d E {1, . . . , D} 

237 i. Sample a distribution over its observed labels dd ^ Dirichlet(- ja) 

238 ii- For each word i G {!,..., } 

239 A. Sample a label ^ ~ Multinomial (6*^) 

B. Sample a word w['^'' ^ Multinomial (0c) from the label c = z^''^ 

242 3. Separate each class of tuning and test sets (keep the original index and new index in the 

243 allocated separated dataset). 

244 4. (Run system combination on each class.) 

245 5 (Reconstruct the system combined results of each class preserving the original index.) 
246 

247 Modified Process in System Combination Given a joint space of NPLM, we need to specify 

248 in which process of the task of system combination among three processes use this NPLM. We 

249 only discuss here the standard system combination using confusion-network. This strategy takes the 

250 following three steps (Very brief explanation of these three is available in Appendix): 
251 

252 • Minimum Bayes Risk decoding |28| (with Minimum Error Rate Training (MERT) process 

253 El) 

£;*i,f « = argmin£,e£i?(^') = argmine^ef V L{E,E')P{E\F) 

256 _ 

257 = argmin^-eg ^ {I - BLEUe{E'))P{E\F) 

258 E'eSE 

• Monolingual word alignment 

260 

2g.| • (Monotone) consensus decoding (with MERT process) 

262 / 

263 Ebest = argmax Jj0(z|ej)pLA/(e) 

264 " i=l 

265 . , 

2gg Similar to the task of n-best reranking in MERT process 0431 . we consider the reranking of nbest 
lists in the third step of above, i.e. (monotone) consensus decoding (with MERT process). We do 
not discuss the other two processes in this paper 

268 

269 On one hand, we intend to use the first feature of NPLM, i.e. the semantically and syntactically 
similar role of words, for paraphrases. The n-best reranking in MERT process Il43l alternate the 
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probability suggested by word sense disambiguation task using the feature of NPLM, while we 
intend to add a sentence which replaces the words using NPLM. On the other hand, we intend to 
fjje second feature of NPLM, i.e. the genre ID, to split a single system combination system into 

273 multiple system combination systems based on the genre ID clusters. In this perspective, the role of 

274 these two feature can be seen as independent. We conducted four kinds of settings below. 
275 

276 

(A) — First Feature: N-Best Re ranking in Monotonic Consensus Decoding without Noise - 
NPLM plain In the first setting for the experiments, we used the first feature without considering 
noise. The original aim of NPLM is to capture the semantically and syntactically similar words 
in a way that a latent word depends on the context. We will be able to get variety of words if we 

280 condition on the fixed context, which would form paraphrases in theory. 
281 

282 introduce our algorithm via a word sense disambiguation (WSD) task which selects the right 

disambiguated sense for the word in question. This task is necessary due to the fact that a text is 
natively ambiguous accommodating with several different meanings. The task of WSD I.14J can be 
written as in dSll: 

285 

286 1 f k 

287 P(synsetjfeatures„0) = Z(features) H di^y^^^k^ fc)/(teature. ) 

288 ^ ' m 
289 

290 where k ranges over all possible features, / (feature j ) is an indicator function whose value is 1 if 
2Q■^ the feature exists, and otherwise, ^(synsetj, k) is a parameter for a given synset and feature, 6* is a 

collection of all these parameters in g(synsetj, fc), and Z is a normalization constant. Note that we 
293 lerm "synset" as an analogy of the WordNet |30|: this is equivalent to "sense" or "meaning". 

Note also that NPLM will be included as one of the features in this equation. If features include 

294 

sufficient statistics, a task of WSD will succeed. Otherwise, it will fail. We do reranking of the 
2^^ outcome of this WSD task. 
296 

297 '^^^ hand, the paraphrases obtained in this way have attractive aspects that can be called 

2gg "a creative word" f50l. This is since the traditional resource that can be used when building a 
translation model by SMT are constrained on parallel corpus. However, NPLM can be trained on 
huge monolingual corpus. On the other hand, unfortunately in practice, the notorious training time 
■'"^ of NPLM only allows us to use fairly small monolingual corpus although many papers made an 
effort to reduce it ||3T1 . Due to this, we cannot ignore the fact that NPLM trained not on a huge 

302 corpus may be affected by noise. Conversely, we have no guarantee that such noise will be reduced 

303 if we train NPLM on a huge corpus. It is quite likely that NPLM has a lot of noise for small corpora. 

304 Hence, this paper also needs to provide the way to overcome difficulties of noisy data. In order to 

305 avoid this difficulty, we limit the paraphrase only when it includes itself in high probability. 
306 

307 

(B) — First Feature: N-Best Reranking in Monotonic Consensus Decoding with Noise - NPLM 
■'"^ dep In the second setting for our experiment, we used the first feature considering noise. Although 
■'"^ we modified a suggested paraphrase without any intervention in the above algorithm, it is also pos- 
3''0 sible to examine whether such suggestion should be adopted or not. If we add paraphrases and the 

311 resulted sentence has a higher score in terms of the modified dependency score [39] (See Figure |3]l, 

312 this means that the addition of paraphrases is a good choice. If the resulted score decreases, we do 

313 not need to add them. One difficulty in this approach is that we do not have a reference which allows 

314 us to score it in the usual manner For this reason, we adopt the naive way to deploy the above and 
3-l5 we deploy this with pseudo references. (This formulation is equivalent that we decode these inputs 
316 MBR decoding.) First, if we add paraphrases and the resulted sentence does not have a very bad 

score, we add these paraphrases since these paraphrase are not very bad {naive way). Second, we 

318 ^'^ scoring between the sentence in question with all the other candidates {pseudo references) and 

calculate an average of them. Thus, our second algorithm is to select a paraphrase which may not 

"^^^ achieve a very bad score in terms of the modified dependency score using NPLM. 
320 

321 

322 (C) — Second Feature: Genre ID — DA (Domain Adaptation) In the third setting of our ex- 

323 periment, we used only the second feature. As is mentioned in the explanation about this feature, 
we intend to splits a single module of system combination into multiple modules of system combi- 
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NP 
John 



NP 

resigned yesterday 



Different structure 
in c-structure 



f-structure 

" SUBJ p PRED john 
NUM sg 
- PERS 3 
PRED resign 
TENSE past 
_ ADJ ([PRED yesterdayJL 

Same representation 
in f-structure 





SUBJ 


^ PRED john^ 






NP NP VP 




NUM sg 






Yesterday John 1 =^ 


PRED 


_ PERS 3 _ 






resign 






resigned 


TENSE 


past 








ADJ 


([PRED yesterday]) 















Figure 3: By the modified dependency score 0391 . the score of these two sentences, "John resigned 
yesterday" and "Yesterday John resigned", are the same. Figure shows c-structure and f-structure of 
two sentences using Lexical Functional Grammar (LFG) ||8l . 



nation according to the genre ID. Hence, we will use the module of system combination tuned for 
the specific genre ID, Q. 

(D) — First and Second Feature — COMBINED In the fourth setting we used both features. 
In this setting, (1) we used modules of system combination which are tuned for the specific genre 
ID, and (2) we prepared NPLM whose context can be switched based on the specific genre of the 
sentence in test set. The latter was straightforward since these two features are stored in joint space 
in our case. 

Experimental Results ML4HMT-2012 provides four translation outputs (si to s4) which are 
MT outputs by two RBMT systems, APERTIUM and LuCY, PB-SMT (MoSES) and HPB-SMT 
(Moses), respectively. The tuning data consists of 20,000 sentence pairs, while the test data con- 
sists of 3,003 sentence pairs. 

Our experimental setting is as follows. We use our system combination module lfT6l[T7l [35l, which 
has its own language modeling tool, MERT process, and MBR decoding. We use the BLEU metric 
as loss function in MBR decoding. We use TERF0 as alignment metrics in monolingual word 
alignment. We trained NPLM using 500,000 sentence pairs from English side of EN-ES corpus of 
EUROPARlJl. 

The results show that the first setting of NPLM-based paraphrased augmentation, that is NPLM 
plain, achieved 25.61 BLEU points, which lost 0.39 BLEU points absolute over the standard sys- 
tem combination. The second setting, NPLM dep, achieved slightly better results of 25.81 BLEU 
points, which lost 0.19 BLEU points absolute over the standard system combination. Note that 
the baseline achieved 26.00 BLEU points, the best single system in terms of BLEU was s4 which 
achieved 25.31 BLEU points, and the best single system in terms of METEOR was s2 which 
achieved 0.5853. The third setting achieved 26.33 BLEU points, which was the best among our 
four settings. The fourth setting achieved 25.95, which is again lost 0.05 BLEU points over the 
standard system combination. 

Other than our four settings where these settings differ which features to use, we run several differ- 
ent settings of system combination in order to understand the performance of four settings. Standard 
system combination using BLEU loss function (line 5 in Table 2), standard system combination 
using TER loss function (line 6), system combination whose backbone is unanamously taken from 
the RBMT outputs (MT input s2 in this case; line 1 1), and system combination whose backbone is 
selected by the modified dependency score (which has three variations in the figure; modDep preci- 



'E.g., we translate newswire with system combination module tuned with newswire tuning set, while we 
translate medical text with system combination module tuned with medical text tuning set. 
^ http://www. cs. umd. edu/^snover/terp 
\http7//www.statmt.org/europarl\ ^ 
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sion, recall and Fscore; line 12, 13 and 14). One interesting characteristics is that the s2 backbone 
(Une 11) achieved the best score among all of these variations. Then, the score of the modified 
dependency measure-selected backbone follows. From these runs, we caimot say that the runs re- 
lated to NPLM, i.e. (A), (B) and (D), were not particularly successful. The possible reason for this 
was that our interface with NPLM was only limited to paraphrases, which was not very successfuly 
chosen by reranking. 





NISX 


BLEU 


MEXEOR 


WER 


PER 


MX input si 


6.4996 


0.2248 


0.5458641 


64.2452 


49.9806 


MX input s2 


6.9281 


0.2500 


0.5853446 


62.9194 


48.0065 


MX input s3 


7.4022 


0.2446 


0.5544660 


58.0752 


44.0221 


MX input s4 


7.2100 


0.2531 


0.5596933 


59.3930 


44.5230 


standard system combination (BLEU) 


7.6846 


0.2600 


0.5643944 


56.2368 


41.5399 


standard system combination (XER) 


7.6231 


0.2638 


0.5652795 


56.3967 


41.6092 


(A) NPLM plain 


7.6041 


0.2561 


0.5593901 


56.4620 


41.8076 


(B) NPLM dep 


7.6213 


0.2581 


0.5601121 


56.1334 


41.7820 


(C) DA 


7.7146 


0.2633 


0.5647685 


55.8612 


41.7264 


(D) COMBINED 


7.6464 


0.2595 


0.5610121 


56.0101 


41.7702 


s2 backbone 


7.6371 


0.2648 


0.5606801 


56.0077 


42.0075 


modDep precision 


7.6670 


0.2636 


0.5659757 


56.4393 


41.4986 


modDep recall 


7.6695 


0.2642 


0.5664320 


56.5059 


41.5013 


modDep Fscore 


7.6695 


0.2642 


0.5664320 


56.5059 


41.5013 



Xable 2: Xhis table shows single best performance, the performance of the standard system combina- 
tion (BLEU and XER loss functions), the performance of four settings in this paper ((A),. . .,(D)), the 
performance of s2 backboned system combination, and the performance of the selection of sentences 
by modified dependency score (precision, recall, and F-score each). 



Conclusion and Perspectives 

Xhis paper proposes a non-parametric Bayesian way to interpret NPLM, which we call ngram- 
HMM language model. Xhen, we add a small extension to this by concatenating other context 
in the same model, which we call a joint space ngram-HMM language model. Xhe main issues 
investigated in this paper was an application of NPLM in bilingual NLP, specifically Statistical 
Machine Xranslation (SMX). We focused on the perspectives that NPLM has potential to open the 
possibility to complement potentially 'huge' monolingual resources into the 'resource-constraint' 
bilingual resources. We compared our proposed algorithms and others. One discovery was that 
when we use a fairly small NPLM, noise reduction may be one way to improve the quality. In our 
case, the noise reduced version obtained 0.2 BLEU points better. 

Further work would be to apply this NPLM in various other tasks in SMX: word alignment, hierar- 
chical phrase-based decoding, and semantic incorporated MX systems in order to discover the merit 
of 'depth' of architecture in Machine Learning. 
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